8 research outputs found

    Sampling arbitrary subgraphs exactly uniformly in sublinear time

    Get PDF
    We present a simple sublinear-time algorithm for sampling an arbitrary subgraph H exactly uniformly from a graph G, to which the algorithm has access by performing the following types of queries: (1) uniform vertex queries, (2) degree queries, (3) neighbor queries, (4) pair queries and (5) edge sampling queries. The query complexity and running time of our algorithm are Õ(min{m, (m^ρ(H))/#H}) and Õ((m^ρ(H))/#H}), respectively, where ρ(H) is the fractional edge-cover of H and #H is the number of copies of H in G. For any clique on r vertices, i.e., H = K_r, our algorithm is almost optimal as any algorithm that samples an H from any distribution that has Ω(1) total probability mass on the set of all copies of H must perform Ω(min{m, (m^ρ(H))/(#H⋅(cr)^r)}) queries. Together with the query and time complexities of the (1±ε)-approximation algorithm for the number of subgraphs H by Assadi et al. [Sepehr Assadi et al., 2018] and the lower bound by Eden and Rosenbaum [Eden and Rosenbaum, 2018] for approximately counting cliques, our results suggest that in our query model, approximately counting cliques is "equivalent to" exactly uniformly sampling cliques, in the sense that the query and time complexities of exactly uniform sampling and randomized approximate counting are within polylogarithmic factor of each other. This stands in interesting contrast to an analogous relation between approximate counting and almost uniformly sampling for self-reducible problems in the polynomial-time regime by Jerrum, Valiant and Vazirani [Jerrum et al., 1986]

    Testable properties in general graphs and random order streaming

    Get PDF
    We present a novel framework closely linking the areas of property testing and data streaming algorithms in the setting of general graphs. It has been recently shown (Monemizadeh et al. 2017) that for bounded-degree graphs, any constant-query tester can be emulated in the random order streaming model by a streaming algorithm that uses only space required to store a constant number of words. However, in a more natural setting of general graphs, with no restriction on the maximum degree, no such results were known because of our lack of understanding of constant-query testers in general graphs and lack of techniques to appropriately emulate in the streaming setting off-line algorithms allowing many high-degree vertices. In this work we advance our understanding on both of these challenges. First, we provide canonical testers for all constant-query testers for general graphs, both, for one-sided and two-sided errors. Such canonizations were only known before (in the adjacency matrix model) for dense graphs (Goldreich and Trevisan 2003) and (in the adjacency list model) for bounded degree (di-)graphs (Goldreich and Ron 2011, Czumaj et al. 2016). Using the concept of canonical testers, we then prove that every property of general graphs that is constant-query testable with one-sided error can also be tested in constant-space with one-sided error in the random order streaming model. Our results imply, among others, that properties like (s,t) disconnectivity, k-path-freeness, etc. are constant-space testable in random order streams

    Every testable (infinite) property of bounded-degree graphs contains an infinite hyperfinite subproperty

    Get PDF
    One of the most fundamental questions in graph property testing is to characterize the combinatorial structure of properties that are testable with a constant number of queries. We work towards an answer to this question for the bounded-degree graph model introduced in [GR02], where the input graphs have maximum degree bounded by a constant d. In this model, it is known (among other results) that every hyperfinite property is constant-query testable [NS13], where, informally, a graph property is hyperfinite, if for every δ > 0 every graph in the property can be partitioned into small connected components by removing δn edges. In this paper we show that hyperfiniteness plays a role in every testable property, i.e. we show that every testable property is either finite (which trivially implies hyperfiniteness and testability) or contains an infinite hyperfinite subproperty. A simple consequence of our result is that no infinite graph property that only consists of expander graphs is constant-query testable. Based on the above findings, one could ask if every infinite testable non-hyperfinite property might contain an infinite family of expander (or near-expander) graphs. We show that this is not true. Motivated by our counterexample we develop a theorem that shows that we can partition the set of vertices of every bounded degree graph into a constant number of subsets and a separator set, such that the separator set is small and the distribution of k-discs on every subset of a partition class, is roughly the same as that of the partition class if the subset has small expansion

    BETULA: Numerically Stable CF-Trees for BIRCH Clustering

    Full text link
    BIRCH clustering is a widely known approach for clustering, that has influenced much subsequent research and commercial products. The key contribution of BIRCH is the Clustering Feature tree (CF-Tree), which is a compressed representation of the input data. As new data arrives, the tree is eventually rebuilt to increase the compression. Afterward, the leaves of the tree are used for clustering. Because of the data compression, this method is very scalable. The idea has been adopted for example for k-means, data stream, and density-based clustering. Clustering features used by BIRCH are simple summary statistics that can easily be updated with new data: the number of points, the linear sums, and the sum of squared values. Unfortunately, how the sum of squares is then used in BIRCH is prone to catastrophic cancellation. We introduce a replacement cluster feature that does not have this numeric problem, that is not much more expensive to maintain, and which makes many computations simpler and hence more efficient. These cluster features can also easily be used in other work derived from BIRCH, such as algorithms for streaming data. In the experiments, we demonstrate the numerical problem and compare the performance of the original algorithm compared to the improved cluster features

    confstream: automated algorithm selection and configuration of stream clustering algorithms

    Get PDF
    Machine learning has become one of the most important tools in data analysis. However, selecting the most appropriate machine learning algorithm and tuning its hyperparameters to their optimal values remains a difficult task. This is even more difficult for streaming applications where automated approaches are often not available to help during algorithm selection and configuration. This paper proposes the first approach for automated algorithm selection and configuration of stream clustering algorithms. We train an ensemble of different stream clustering algorithms and configurations in parallel and use the best performing configuration to obtain a clustering solution. By drawing new configurations from better performing ones, we are able to improve the ensemble performance over time. In large experiments on real and artificial data we show how our ensemble approach can improve upon default configurations and can also compete with a-posteriori algorithm configuration. Our approach is considerably faster than a-posteriori approaches and applicable in real-time. In addition, it is not limited to stream clustering and can be generalised to all streaming applications, including stream classification and regression

    PMLR

    No full text
    We study fine-grained error bounds for differentially private algorithms for counting under continual observation. Our main insight is that the matrix mechanism when using lower-triangular matrices can be used in the continual observation model. More specifically, we give an explicit factorization for the counting matrix Mcount and upper bound the error explicitly. We also give a fine-grained analysis, specifying the exact constant in the upper bound. Our analysis is based on upper and lower bounds of the completely bounded norm (cb-norm) of Mcount . Along the way, we improve the best-known bound of 28 years by Mathias (SIAM Journal on Matrix Analysis and Applications, 1993) on the cb-norm of Mcount for a large range of the dimension of Mcount. Furthermore, we are the first to give concrete error bounds for various problems under continual observation such as binary counting, maintaining a histogram, releasing an approximately cut-preserving synthetic graph, many graph-based statistics, and substring and episode counting. Finally, we note that our result can be used to get a fine-grained error bound for non-interactive local learning and the first lower bounds on the additive error for (ϵ,δ)-differentially-private counting under continual observation. Subsequent to this work, Henzinger et al. (SODA, 2023) showed that our factorization also achieves fine-grained mean-squared error

    Fair Coresets and Streaming Algorithms for Fair k-means

    No full text
    We study fair clustering problems as proposed by Chierichetti et al. [CKLV17]. Here, points have a sensitive attribute and all clusters in the solution are required to be balanced with respect to it (to counteract any form of data-inherent bias). Previous algorithms for fair clustering do not scale well. We show how to model and compute so-called coresets for fair clustering problems, which can be used to significantly reduce the input data size. We prove that the coresets are composable [IMMM14] and show how to compute them in a streaming setting. This yields a streaming PTAS for fair k-means in the case of two colors (and exact balances). Furthermore, we extend techniques due to Chierichetti et al. [CKLV17] to obtain an approximation algorithm for k-means, which leads to a constant factor algorithm in the streaming model when combined with the coreset

    Theoretical Analysis of the k-Means Algorithm – A Survey

    No full text
    The kk-means algorithm is one of the most widely used clustering heuristics. Despite its simplicity, analyzing its running time and quality of approximation is surprisingly difficult and can lead to deep insights that can be used to improve the algorithm. In this paper we survey the recent results in this direction as well as several extension of the basic kk-means method
    corecore